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Abstract 

Docnment  page  segmentation  is  a  crncial  preprocessing  step  in  Optical  Character  Recog¬ 
nition  (OCR)  systems.  While  nnmerons  page  segmentation  algorithms  have  been  pro¬ 
posed,  there  is  relatively  less  literatnre  on  comparative  evalnation  —  empirical  or  theoret¬ 
ical  —  of  these  algorithms.  For  the  existing  performance  evalnation  methods,  two  crncial 
components  are  nsnally  missing:  1)  antomatic  training  of  algorithms  with  free  parame¬ 
ters  and  2)  statistical  and  error  analysis  of  experimental  resnlts.  In  this  thesis,  we  nse 
the  following  hve-step  methodology  to  qnantitatively  compare  the  performance  of  page 
segmentation  algorithms:  1)  First  we  create  mntnally  exclnsive  training  and  test  datasets 
with  gronndtrnth,  2)  we  then  select  a  meaningfnl  and  compntable  performance  metric,  3) 
an  optimization  procednre  is  then  nsed  to  search  antomatically  for  the  optimal  parameter 
valnes  of  the  segmentation  algorithms,  4)  the  segmentation  algorithms  are  then  evalnated 
on  the  test  dataset,  and  hnally  5)  a  statistical  error  analysis  is  performed  to  give  the  sta¬ 
tistical  signihcance  of  the  experimental  resnlts.  The  antomatic  training  of  algorithms  is 
posed  as  an  optimization  problem  and  a  direct  search  method  —  the  simplex  method  — 
is  nsed  to  search  for  a  set  of  optimal  parameter  valnes.  A  paired-model  statistical  analysis 
and  an  error  analysis  are  condncted  to  provide  conhdence  intervals  for  the  experimental 
resnlts  and  to  interpret  the  fnnctionalities  of  algorithms.  This  methodology  is  applied 
to  the  evalnation  of  hve  page  segmentation  algorithms,  of  which  three  are  representative 
research  algorithms  and  the  other  two  are  well-known  commercial  prodncts,  on  978  im¬ 
ages  from  the  University  of  Washington  111  dataset,  ft  is  fonnd  that  the  performances  of 
the  Voronoi,  Docstrnm  and  Caere  segmentation  algorithms  are  not  signihcantly  different 
from  each  other,  bnt  they  are  signihcantly  better  than  that  of  ScanSoft’s  segmentation 
algorithm,  which  in  tnrn  is  signihcantly  better  than  that  of  X-Y  cnt. 

This  research  was  funded  in  part  by  the  Department  of  Defense  and  the  Army  Research  Laboratory 
under  Contract  MDA  9049-6C-1250. 


1  Introduction 


Optical  Character  Recognition  (OCR)  is  the  antomated  process  of  translating  an  inpnt 
docnment  image  into  a  symbolic  text  hie.  The  inpnt  docnment  images  can  come  from 
a  large  variety  of  media  snch  as  jonrnals,  books,  newspapers,  magazines,  microhlms, 
personal  notes,  etc.  They  can  be  digitally  created,  faxed  or  scanned  docnment  images. 
The  format  of  a  docnment  image  can  be  handwritten  or  machine  printed.  A  docnment 
can  contain  text,  tables,  hgnres  and  halftone  images.  The  ontpnt  symbolic  text  hie  from 
an  OCR  system  can  inclnde  only  the  text  content  of  the  inpnt  docnment  image,  or  it  can 
also  inclnde  additional  descriptive  information  snch  as  page  layont,  font  size  and  style, 
docnment  region  type,  conhdence  level  for  the  recognized  characters,  etc. 

Page  segmentation  is  a  crncial  preprocessing  step  in  an  OCR  system.  It  is  the  process 
of  dividing  a  docnment  image  into  homogeneons  zones,  i.e.,  those  zones  that  only  contain 
one  type  of  information  snch  as  text,  a  table,  a  hgnre  or  a  halftone  image.  In  many 
cases,  OCR  system  accnracy  heavily  depends  on  the  accnracy  of  the  page  segmentation 
algorithm.  While  nnmerons  page  segmentation  algorithms  have  been  proposed  in  the 
past,  relatively  little  research  effort  has  been  devoted  to  the  comparative  evalnation  — 
empirical  or  theoretical  —  of  these  algorithms. 

This  report  is  organized  as  follows.  In  Section  2  we  condnct  a  snrvey  of  related  lit- 
eratnre.  In  Section  3  we  provide  the  problem  dehnition  for  page  segmentation,  error 
measnrements  and  a  metric.  In  Section  4  we  ontline  onr  hve-step  empirical  performance 
evalnation  methodology.  In  Section  5,  antomatic  algorithm  training  is  posed  as  an  opti¬ 
mization  problem  and  a  simplex  algorithm  is  described.  In  Section  6  onr  paired  model 
statistical  analysis  method  is  presented.  In  Section  7  the  segmentation  algorithms  that 
we  evalnated  are  described.  In  Section  8  the  experimental  protocol  for  condncting  the 
training  and  testing  experiments  is  presented.  In  Section  9  we  report  experimental  re- 
snlts  and  provide  a  detailed  discnssion.  Finally,  in  Section  10  we  give  onr  conclnsions. 
We  have  reported  part  of  the  work  presented  in  this  thesis  in  Docnment  Recognition  and 
Retrieval  VII  [23]. 

2  Literature  Survey 

Page  segmentation  algorithms  can  be  categorized  into  three  classes:  top-down  approaches, 
bottom-np  approaches  and  hybrid  approaches.  The  Docstrnm  algorithm  of  O’Gorman 
[28],  the  Voronoi-diagram-based  algorithm  of  Rise  [19],  the  rnn-length  smearing  algo¬ 
rithm  of  Wahl,  Wong  and  Casey  [43],  the  segmentation  algorithm  of  .Jain  and  Yn  [15], 
and  the  text  string  separation  algorithm  of  Fletcher  and  Kastnri  [7]  are  typical  bottom- 
np  algorithms,  while  the  X-Y  cnt  by  Nagy  [25,  26]  and  the  shape-directed-covers-based 
algorithm  by  Baird  [2,  1]  are  top-down  algorithms.  Pavlidis  and  Zhon  [30]  proposed  a 
hybrid  algorithm  nsing  a  split-and-merge  strategy.  A  snrvey  of  OCR  and  page  segmen¬ 
tation  algorithms  can  be  fonnd  in  O’Gorman  and  Kastnri  [29]  and  Jain  and  Yn  [15].  A 
recent  workshop  [5]  was  devoted  to  addressing  issnes  related  to  page  segmentation. 

While  many  segmentation  algorithms  have  been  proposed  in  the  literatnre,  relatively 
few  researchers  have  addressed  the  issne  of  qnantitative  evalnation  of  segmentation  algo¬ 
rithms.  Several  page  segmentation  performance  evalnation  methods  have  been  proposed 
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in  the  past.  Kanai  et  al.  [16]  proposed  a  metric  that  is  a  weighted  sum  of  the  number 
of  edit  operations  (insertions,  deletions  and  moves).  The  advantage  of  this  method  is 
that  it  requires  only  ASCII  text  groundtruth  and  hence  does  not  require  zone  or  textline 
bounding-box  groundtruth.  The  limitations  of  this  method  are  that  it  cannot  specify  the 
error  location  in  the  image,  it  is  dependent  on  the  OCR  engine’s  recognition  accuracy, 
and  the  metric  cannot  be  computed  for  languages  for  which  no  OCR  engine  is  avail¬ 
able.  Rice,  Jenkins  and  Nartker  [38]  used  this  performance  metric  in  their  comparative 
evaluation  of  the  automatic  zoning  accuracy  of  four  commercial  OCR  products.  Vincent 
et  al.  [36,  37,  45]  proposed  various  bitmap-level  region-based  metrics.  The  advantages 
of  the  Vincent  et  al  approach  are  that  it  can  evaluate  both  text  regions  and  non-text 
regions,  it  is  independent  of  zone  representation  schemes,  the  errors  can  be  localized  and 
categorized,  and  the  performance  metric  can  be  customized  by  the  users.  A  limitation  of 
this  method  is  that  the  metric  is  dependent  on  pixel  noise.  Liang,  Phillips  and  Haralick 
[22]  describe  a  region-area-based  metric.  The  overlap  area  of  a  groundtruth  zone  and  a 
segmentation  zone  is  used  to  compute  this  performance  metric.  Many  OCR  performance 
evaluation  case  studies  are  discussed  in  [14]. 

In  the  general  computer  vision  area,  numerous  researchers  have  presented  methods 
for  empirical  performance  evaluation.  For  example.  Hoover  et  al.  [13]  proposed  an  exper¬ 
imental  framework  for  quantitative  comparison  of  range  image  segmentation  algorithms 
and  demonstrated  the  methodology  by  evaluating  four  range  segmentation  algorithms. 
Kanungo  et  al.  [17]  described  a  four-step  methodology  for  the  evaluation  of  two  detection 
algorithms.  Phillips  and  Chhabra  [32]  presented  a  methodology  for  empirically  evaluat¬ 
ing  graphics  recognition  systems.  These  methodologies  have  not  addressed  the  issues  of 
automatic  training  of  algorithms  with  free  parameters  and  statistical  and  error  analysis 
of  experimental  results.  Phillips  et  al.  [33]  proposed  the  FERET  evaluation  methodology 
for  face  recognition  algorithms.  However,  the  problem  addressed  here  is  only  face  clas- 
sihcation  and  in  particular  not  face  segmentation.  A  special  issue  of  IEEE  Transactions 
on  Pattern  Analysis  and  Machine  Intelligence  (Vol.  21,  1999)  was  devoted  to  empirical 
evaluation  of  computer  vision  algorithms.  Two  workshops  have  been  devoted  to  empirical 
evaluation  techniques  and  methodologies  in  computer  vision  [3,  10]. 

In  research  segmentation  algorithms  that  have  user-specihable  parameters,  typically 
the  default  parameter  values  are  selected  and  no  training  method  is  explicitly  specihed 
[19,  2,  28,  15,  30,  7].  Similarly,  in  performance  evaluation  literature  where  the  algorithm 
parameters  can  be  set  by  evaluators,  a  set  of  parameter  values  are  usually  selected  man¬ 
ually  in  the  training  procedure  [13,  17,  32].  A  common  aspect  of  these  parameter  value 
selection  methods  and  training  methods  is  that  a  set  of  “optimal  parameter  values”  are 
manually  selected  based  on  some  assumption  regarding  the  training  dataset.  To  objec¬ 
tively  optimize  a  segmentation  algorithm  on  a  given  training  dataset,  a  set  of  optimal 
parameter  values  should  be  automatically  found  by  a  training  procedure.  Automatic 
training  of  any  algorithm  with  free  parameters  is  actually  an  optimization  problem.  In 
the  optimization  area,  there  are  a  number  of  classes  of  optimization  problems  based  on 
the  properties  of  the  given  objective  function.  An  in-depth  discussion  and  classihcation 
of  optimization  problems  can  be  found  in  Gill,  Murray  and  Wright  [8].  In  our  case, 
since  the  objective  function  corresponding  to  a  performance  metric  for  page  segmenta¬ 
tion  algorithms  is  not  rigorously  dehned  mathematically,  automatic  training  is  posed 
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as  a  multivariate  non-smooth  nonlinear  function  optimization  problem.  Direct  search 
algorithms  are  typically  used  for  solving  optimization  problems  involving  this  kind  of 
objective  function.  Powell  [34]  gives  a  detailed  survey  of  direct  search  algorithms.  Line 
search  methods,  discrete  grid  methods,  simplex  methods,  conjugate  direction  methods, 
linear  approximation  methods,  and  quadratic  approximation  methods  are  designed  to 
converge  to  a  local  minimum  of  the  objective  function.  Additional  survey  literature  re¬ 
garding  direct  search  algorithms  can  be  found  in  [2f,  44].  Many  practical  optimization 
calculations  have  many  local  minima  that  are  not  optimal.  “Simulated  annealing”  [20] 
and  “genetic”  [9]  algorithms  are  proposed  to  search  for  a  global  minimum  by  selecting  vec¬ 
tors  of  variables  using  random  number  generators.  We  chose  the  simplex  search  method 
proposed  by  Nelder  and  Mead  [27]  since  it  is  recognized  as  one  of  the  most  reliable  and 
efficient  methods  [4,  41]  for  optimizing  an  objective  function  for  which  derivatives  are 
not  available. 

3  The  Page  Segmentation  Problem  and  Error  Metrics 

In  this  section,  we  give  the  dehnition  of  page  segmentation.  In  order  to  evaluate  the 
performance  of  page  segmentation  algorithms,  a  set  of  error  measurements  and  metrics 
are  needed.  We  provide  the  dehnitions  of  our  proposed  textline  based  error  measures 
and  metric.  These  dehnitions  are  based  on  set  theory  and  mathematical  morphology 
[42,  24,  12]. 

3.1  Page  Segmentation  Definition 

Let  /  be  a  document  image,  and  let  G  be  the  groundtruth  of  I.  Let  Z(G)  =  {Z^,q  = 
1,  2, .  .  .  ,  ^Z(G)}  be  a  set  of  groundtruth  zones  of  document  image  I  where  ^  de¬ 
notes  the  cardinality  of  a  set.  Let  L[Z^)  =  =  1,  2, .  .  .  ,  be  the  set  of 

groundtruth  textlines  in  groundtruth  zone  Z^ .  Let  the  set  of  all  groundtruth  textlines 

in  document  image  I  he  C  =  L{Z^).  Let  A  be  a  given  segmentation  algo¬ 

rithm,  and  let  SegA{--,-)  be  the  segmentation  function  corresponding  to  algorithm  A. 
Let  R  be  the  segmentation  result  of  algorithm  A  such  that  R  =  SegA{I ,p^)  where 
Z(/?)  =  {Z«|fc  =  l,2,...,#Z(if)}. 

Let  D(-)  C  be  the  domain  of  its  argument.  The  groundtruth  zones  and  textlines 
have  the  following  properties: 

1.  D{Z^)  n  D{Z^)  =  (f)  for  Z'f,  G  Z{G)  and  q  ^  q' ,  and 

2.  D{lf  )  n  D{lf)  =  4>  for  /f ,  l^  G  £  and  z  ^  i'. 

In  our  evaluation  method,  we  evaluate  deskewed  document  images  with  rectangular 
zones  and  textline  groundtruth.  Some  groundtruth  generation  methods  provide  only 
zone-level  groundtruth.  In  our  evaluation  methodology,  we  need  both  zone-level  and 
textline-level  groundtruth. 
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3.2  Error  Measurements  and  Metric  Definitions 

A  meaningful  and  computable  performance  metric  is  essential  for  evaluating  page  segmen¬ 
tation  algorithms  quantitatively.  While  a  performance  metric  is  typically  not  unique,  and 
researchers  can  select  a  particular  performance  metric  to  study  certain  aspects  of  page 
segmentation  algorithms,  a  set  of  error  measurements  is  necessary.  Let  Tx,  Ty  G  Z‘’'U{0} 
be  two  length  thresholds  (in  number  of  pixels)  that  determine  if  the  overlap  is  signifi¬ 
cant  or  not.  Let  E(Tx-,Ty)  =  {e  G  —  Tx  <  ^  Tx-,—Ty  <  Y{e)  <  Ty}  be 

a  rectangular  region  centered  at  (0,  0)  with  a  width  of  2Tx  and  a  height  of  2Ty  where 
X(-)  and  E(-)  denote  the  X  and  Y  coordinates  of  the  argument  respectively.  We  now 
define  two  morphological  operations:  dilation  and  erosion  [42,  24,  t2].  Let  B  C 
Morphological  dilation  of  A  by  i?  is  denoted  hy  A®  B  and  is  defined  as 

A05  =  |cGZ^|c  =  a  +  &  for  some  a  G  A,  &  G  i?| . 

Morphological  erosion  of  A  by  i?  is  denoted  hy  AQ  B  and  is  defined  as 

A  0  5  =  |c  G  |c  +  &  G  A  for  every  &  G  i?| . 

We  first  define  correctly  detected  groundtruth  textlines,  and  then  define  four  types  of 
textline-based  error  measurements. 


t.  Groundtruth  textlines  that  are  correctly  detected: 

Dl  =  {F  G  C\D{1^)  0  E{Tx,  Ty)  C  D{Z^)  for  some  G  Z{R)]  , 

2.  Groundtruth  textlines  that  are  missed: 

Ci  =  {l°e  £\D{I°)  e  E{Tx,  Ty)  C  (Uz«ez(K>-D(Z"))°}  ■ 

3.  Groundtruth  textlines  whose  bounding  boxes  are  split: 

Sl  =  {F  G  C\{D{F)  0  E{Tx,  Ty))  n  D{Z^)  c), 

{D(l^)  0  E(Tx,  Ty))  n  [D[Z^)Y  Y  for  some  Z^  G  Z{R)} 

4.  Groundtruth  textlines  that  are  horizontally  merged: 

Ml  =  {/g  G  G  £,  G  Z{R),  q  Y  q)  G  Z{G) 

{D{lYY  e  E{Tx,  Ty))  n  D{Z,)  +  (D(/g^.O  0  EiTx,  Ty))  n  D(Z,)  Y 

{{D{IYY  e  E{0,  Ty))  0  E(c^,  0))  n  D(Zg)  Y  Y 
((D(/g,.0  0  E(0,  Ty))  0  E(c^,  0))  n  D(Zg)  ^  )>}  ■ 

5.  Noise  zones  that  are  falsely  detected  (false  alarms): 

El  =  G  Z{R)\D{Z^)  C  {Uia^L{D{l^)  G  EiT^^Ty)))^^} 

Figure  1  shows  an  example  of  errors  in  groundtruth  textlines. 

Let  the  number  of  groundtruth  error  textlines  be  )){Cl  U  U  Ml}  (mis-detected, 
split  or  horizontally  merged)  ,  and  let  the  total  number  of  groundtruth  textlines  be 
We  define  the  performance  metric  p(/,  G,  R)  as  textline  accuracy: 


p(/,G,  R) 


—  #{Cl  U  Al  U  Ml} 


(1) 
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Zk  'll 

(a)  (b)  (c) 


Figure  1:  This  figure  shows  examples  of  textline  errors,  (a)  shows  a  groundtruth  textline 
split  into  two  segmentation  zones  and  Z^-  (b)  shows  two  groundtruth  zones 

Z^  and  Z^  horizontally  merged  into  segmentation  zone  Z^.  Only  the  dark  groundtruth 
textlines  are  considered  horizontally  merged  since  they  are  the  only  textlines 

that  are  impacted  by  the  horizontal  merge,  (c)  shows  two  textlines  and  on  which 
multiple  errors  happen,  is  split  by  Z^  and  merged  by  Z^,  is  split  by  Z^  and  Z^ 
and  merged  by  Z^-  In  our  metric  we  count  two  instance  of  textline  error. 


In  general,  the  performance  metric  p(/,  G,  R)  can  be  any  function  of  £,  Dl,  SL^  Mi 
and  Fi.  Figure  2  gives  a  set  of  possible  errors  as  well  as  an  experimental  example. 

We  consider  three  types  of  textline  errors  —  split,  missed  and  horizontally  merged. 
We  see  that  this  textline-based  performance  metric  has  the  following  features:  t)  it  is 
rigorously  defined  using  set  theory  and  mathematical  morphology,  2)  it  is  independent 
of  zone  shape,  3)  it  is  independent  of  OCR  recognition  error,  4)  it  ignores  the  back¬ 
ground  information  (white  space,  salt  and  pepper  noise,  etc.),  5)  segmentation  errors 
can  be  localized,  and  6)  quantitative  evaluation  of  lower  level  (e.g.  textline,  word  and 
character)  segmentation  algorithms  can  be  readily  achieved  with  little  modification.  This 
performance  metric,  however,  requires  textline-level  groundtruth. 

4  Performance  Evaluation  Methodology 

We  now  introduce  a  five-step  methodology.  In  this  methodology,  we  identify  three  crucial 
components:  automatic  training,  statistical  analysis,  and  error  analysis. 

A  large  and  representative  dataset  is  desirable  in  any  performance  evaluation  task 
in  order  to  give  objective  performance  measurements  of  the  algorithms.  A  typical  page 
segmentation  algorithm  has  a  set  of  parameters  that  affect  its  performance.  The  perfor¬ 
mance  index  is  usually  a  user-defined  performance  metric  that  measures  an  aspect  of  the 
algorithm  that  the  user  is  interested  in.  In  order  to  evaluate  a  page  segmentation  algo¬ 
rithm  on  a  specific  dataset,  a  set  of  optimum  parameters  has  to  be  used.  The  optimum 
parameter  set  is  a  function  of  the  given  dataset,  the  groundtruth,  and  the  performance 
metric.  The  set  of  optimum  parameters  for  one  dataset  may  be  a  non-optimal  parameter 
set  for  another  dataset.  Hence  the  choice  of  parameters  is  crucial  in  any  performance 
evaluation  task.  When  the  size  of  the  dataset  gets  very  large,  parameter  set  training 
on  the  whole  dataset  becomes  computationally  prohibitive  and  therefore  a  representative 
sample  dataset  of  much  smaller  size  must  be  used  as  a  training  dataset.  After  the  training 
step,  the  page  segmentation  algorithms  with  the  optimal  parameters  should  be  evaluated 
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(a) 


(b) 


Figure  2:  (a)  This  figure  shows  a  set  of  possible  textline  errors.  Solid  line  rectangles 
denote  groundtruth  zones,  dashed-line  rectangles  denote  OCR  segmentation  zones,  dark 
bars  within  groundtruth  zones  denote  groundtruth  textlines,  and  dark  bars  outside  solid 
lines  are  noise  blocks,  (b)  This  figure  shows  a  document  page  image  from  the  University 
of  Washington  111  dataset  with  the  groundtruth  zones  overlaid,  (c)  This  figure  shows  an 
OCR  experimental  segmentation  result  on  this  document  page  image,  (d)  This  figure 
shows  segmentation  error  textlines.  Notice  that  there  are  two  horizontally  merged  zones 
just  below  the  caption  and  two  horizontally  merged  zones  in  the  middle  of  the  text  body. 
In  OCR  output,  horizontally  split  zones  cause  reading  order  errors  whereas  vertically 
split  zones  do  not  cause  such  errors. 
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on  a  test  dataset  that  is  different  from  the  training  set.  Finally,  in  order  to  interpret  the 
signihcance  of  the  experimental  resnlts,  a  statistical  error  analysis  shonld  be  performed. 
Let  T>  he  a,  given  dataset  containing  docnment  image  and  gronndtrnth  pairs  (/,  G).  The 
steps  in  onr  methodology  for  evalnating  page  segmentation  algorithms  are  as  follows: 

1.  Randomly  partition  the  dataset  T>  into  a  mntnally  exclnsive  training  dataset  T  and 

test  dataset  S .  Thns  T>  =  T  U  S  and  iT  H  where  (j)  is  the  empty  dataset. 

2.  Dehne  a  meaningfnl  and  compntable  performance  metric  p(/,  G,  R)  where  I  is  an 
docnment  image,  G  is  the  gronndtrnth  of  /,  and  R  is  the  segmentation  resnlt  on  I. 

3.  For  a  selected  segmentation  algorithm  A,  specify  its  parameter  vector  and  an- 
tomatically  hnd  the  optimal  parameter  setting  for  which  an  objective  fnnction 
fip^]  T,  p.  A)  assnmes  the  “best”  measnre  on  the  training  dataset  T. 

4.  Fvalnate  the  segmentation  algorithm  A  with  optimized  parameters  p'^  on  the  test 
dataset  5  by  $  (|{p(G,  5'e5'^(/,  p'^)) |(/,  G)  G  5}^  where  $  is  a  fnnction  of  the  per¬ 
formance  metric  p  on  each  docnment  image  and  gronndtrnth  pair  (/,  G)  in  the  test 
dataset  and  SepA^-^-)  is  the  segmentation  fnnction  corresponding  to  A.  The 
fnnction  $  is  dehned  by  the  nser.  In  onr  case,  $  is  dehned  as  the  average  of  the 
performance  metric  p[G^  SegA{I on  each  docnment  image  and  gronndtrnth 
pair  (/,  G)  in  the  test  dataset  S . 

5.  Perform  a  statistical  analysis  to  hnd  the  signihcance  of  the  evalnation  resnlts  and 
identify /hypothesize  why  the  algorithms  perform  at  the  respective  levels. 

The  above  methodology  can  be  applied  to  any  segmentation  algorithm  that  has  free 
parameters.  If  the  algorithm  does  not  have  free  parameters,  as  is  the  case  with  many 
commercial  algorithms,  we  do  not  perform  the  training  step. 

This  methodology  is  similar  to  typical  methodologies  nsed  in  pattern  recognition.  In 
pattern  recognition,  problems  are  nsnally  well-dehned  mathematically  and  hence  a  better 
training  strategy  and  optimization  method  can  be  nsed  than  in  onr  case,  where  page  seg¬ 
mentation  algorithms  are  not  rigoronsly  dehned  mathematically.  In  the  compnter  vision 
and  image  processing  literatnres,  Kannngo  et  al  [17]  condncted  a  qnantitative  perfor¬ 
mance  evalnation  of  two  detection  algorithms.  Hoover  et  al.  [13]  qnantitatively  compared 
fonr  range  image  algorithms.  In  both  of  these  papers,  while  a  detailed  methodology  and 
experimental  framework  were  carefnlly  designed,  two  important  components,  antomatic 
algorithm  training  and  statistical  analysis  of  the  experimental  resnlts,  were  missing. 

5  Automatic  Algorithm  Training:  The  Optimization  Problem 

Any  antomatic  training  or  learning  problem  can  be  posed  as  an  optimization  problem. 
An  optimization  problem  has  three  components:  the  objective  fnnction  that  gives  a  single 
measnre,  a  set  of  parameters  that  the  objective  fnnction  is  dependent  on,  and  a  param¬ 
eter  snbspace  that  dehnes  acceptable  or  reasonable  parameter  valnes.  The  acceptable  or 
reasonable  parameter  snbspace  dehnes  the  constraints  on  the  optimization  problem.  The 
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purpose  of  an  optimization  procedure  is  to  find  a  set  of  parameter  values  for  which  the  ob¬ 
jective  function  gives  the  “best”  (minimum  or  maximum)  measure  values.  In  this  section, 
we  first  define  the  objective  function  in  our  performance  evaluation  of  page  segmentation 
algorithms,  then  we  introduce  a  direct  search  algorithm  to  optimize  the  defined  objective 
function,  and  finally  we  discuss  starting  point  selection  in  our  optimization  problem. 


5.1  The  Objective  Function 

In  this  subsection,  we  identify  the  objective  function.  Let  be  the  parameter  vector  for 
the  segmentation  algorithm  A,  let  T  be  a  training  dataset,  and  let  p(I,  G,  Segji^[I ,p^)) 
where  (/,  G)  G  T  is  a  performance  metric.  We  define  the  objective  function  f{p^]  T,  A,  p) 
to  be  minimized  as  the  average  textline  error  rate  on  the  training  dataset: 


f{p^-,T,A,p) 


1 

w 


^  -  p{G,SegA{I,p^)) 

.(/,G)er 


(2) 


where  p  is  defined  in  Equation  (1). 

This  objective  function  has  the  following  properties: 

•  It  is  dependent  on  the  values  of  the  algorithm  parameters, 

•  The  function  value  is  the  only  information  available, 

•  The  function  has  no  explicit  mathematical  form  and  is  non-ditferentiable, 

•  Obtaining  a  function  value  requires  nontrivial  computation. 

This  objective  function  can  be  classified  as  a  multivariate  non-smooth  function  [8].  In 
the  following  section,  we  describe  an  optimization  algorithm  to  minimize  this  objective 
function. 


5.2  The  Simplex  Search  Method 

Direct  search  methods  are  typically  used  to  solve  the  optimization  problem  described  in 
Section  4.1.  We  choose  the  simplex  search  method  proposed  by  Nelder  and  Mead  [27]  to 
minimize  our  objective  function. 

We  give  the  notation  used  to  describe  the  simplex  method:  Let  qo  and  Xi^i  =  1, .  .  .  ,  n 
be  a  starting  point  and  a  set  of  scales,  let  e,,  ii  =  1, .  .  .  ,  n  be  n  orthogonal  unit  vectors  in 
n-dimensional  parameter  space,  let  po, .  .  . ,  p^  be  (n  +  1)  ordered  points  in  n-dimensional 
parameter  space  such  that  their  corresponding  function  values  satisfy  /o  <  /i  <,...,< 
fn,  let  p  =  YGiZo  Pit'll  be  the  centroid  of  the  n  best  (smallest)  points,  let  [piPj]  be  the 
n-dimensional  Euclidean  distance  from  pi  to  pj,  let  a,  /?,  7  and  a  be  the  reflection, 
contraction,  expansion  and  shrinkage  coefficient,  respectively,  and  let  T  be  the  threshold 
for  the  stopping  criterion.  We  use  the  standard  choice  for  the  coefficients:  a  =  1,  /3  =  0.5, 
7  =  2,  cr  =  0.5.  We  set  T  to  10“®.  Eigure  5.2  shows  the  various  simplex  operations. 

Eor  a  segmentation  algorithm  with  n  parameters,  the  Nelder-Mead  algorithm  works 
as  follows: 
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(c)  (d) 


Figure  3:  This  figure  shows  four  simplex  operations  in  a  two-dimensional  parameter  space. 
The  solid  lines  denote  the  simplex  before  any  operation  and  the  dashed  lines  denote  the 
simplex  after  the  operation.  p2  and  po  are  the  vertices  for  which  the  objective  funtion  /(•) 
assumes  the  biggest  and  smallest  values  respectively,  and  p  =  ^i_oPi/2  is  the  centroid 
of  the  two  best  vertices.  The  operations  are  (a)  a  reflection  p^  of  p2  with  respect  to 
the  centroid  point  p,  (b)  an  expansion  Pe  of  p2  with  respect  to  the  centroid  point  p,  (c) 
a  contraction  pc  of  p2  with  respect  to  the  centroid  point  p,  and  (d)  a  shrinkage  of  all 
Pi,  i  0  toward  po.  A  local  minimum  can  be  obtained  after  an  appropriate  sequence  of 
such  operations. 
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1.  Given  qo  and  the  Aj,  form  the  initial  simplex  qo  =  {qo,  qi, .  .  .  ,  qn} 

q*  =  qo  +  =  1, . . .  ,n., 

2.  Relabel  the  n  +  1  vertices  as  po, .  .  . ,  Pn  with  /(po)  ^  /(Pi)---  <  fiPn), 

3.  Get  a  reflection  point  p^  of  p„  by  p^  =  (f  +  a)p  —  ap„  where  a  =  [PrP]/[PnP]- 

4.  If  fiPr)  <  /(Po),  replace  p„  by  p^,  get  an  expansion  point  p^  of  p„  by  p^  = 
(f  -  7)p  +  7p„  where  7  =  [PeP]/[p«p]  >  1- 

5.  Else  if  f{Pr)  >  /(p„_i),  if  /(Pr)  <  f{Pn)  replace  p„  by  p^,  get  a  contraction  point 
Pc  of  Pn  by  Pc  =  (f  -  /?)p  +  /3p„  where  (3  =  [PcP]/[PnP]  <  1-  If  f{Pc)  >  f{Pn), 
shrink  the  simplex  aronnd  the  best  vertices  po  by  p^  =  (Pi  +  Po)c’',  i  ^  0,  else  replace 
Pn  by  Pc. 

6.  Else  replace  p„  by  p^. 

7.  If  VeLoI/IpO  -  /(p))V^  <  r,  stop. 

8.  Else  go  back  to  step  2. 

5.3  Starting  Point  Selection 

The  objective  fnnction  corresponding  to  each  segmentation  algorithm  need  not  have  a 
nniqne  minimum.  Enrthermore,  direct  search  optimization  algorithms  are  local  optimiza¬ 
tion  algorithms.  Thns,  for  each  (different)  starting  point,  the  optimization  algorithm 
conld  converge  to  a  different  optimal  solntion.  We  constrain  the  parameter  valnes  to  lie 
within  a  reasonable  range  and  randomly  choose  six  starting  locations  within  this  range. 
The  optimal  solntion  corresponding  to  the  lowest  optimal  valne  is  chosen  as  the  best 
optimal  parameter  vector. 

6  Statistical  Analysis:  A  Paired  Model  Approach 

In  comparative  performance  evalnation  frameworks,  statistical  analysis  plays  a  crncial 
role  in  objectively  interpreting  the  experimental  resnlts.  In  onr  experiments,  we  compare 
the  performance  metric  valnes  (average  textline  accnracy)  of  page  segmentation  algo¬ 
rithms  against  each  other.  In  doing  so,  some  basic  qnestions  are  immediately  raised:  1) 
If  the  performance  metric  of  one  algorithm  is  better  than  that  of  another  algorithm,  is  the 
resnlt  statistically  signihcant?  2)  What  is  the  nncertainty  in  the  estimated  performance 
metric?  3)  Are  the  algorithms  in  one  class  performing  signihcantly  better  than  those  in 
another  class?  4)  What  are  the  sonrces  of  performance  metric  variance?  To  answer  snch 
qnestions,  a  statistical  model  needs  to  be  constrncted  for  the  experimental  observations. 

In  this  section,  we  describe  a  paired  model  analysis  approach  proposed  by  Kannngo 
et  al.  [18]  for  their  evalnation  of  Arabic  OCR  engines  and  adapt  it  to  analyze  onr 
experimental  resnlts.  We  nse  the  paired  model  to  model  onr  experimental  observations 
and  to  provide  nnderlying  theory  for  creating  conhdence  intervals  and  testing  hypotheses. 
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6.1  Modeling  Experimental  Data  Using  the  Paired  Model 

In  this  section,  we  set  np  the  notation  and  ht  the  paired  model  to  onr  experimental 
observation  data.  Let  Ai,  A2  ■  ■  ■  ^  Aj.  denote  the  k  algorithms  we  evalnate,  and  let  Xjj,  i  = 
1, .  .  .  ,  fc,  j  =  1, .  .  . ,  n  be  the  observation  (textline  accnracy)  corresponding  to  algorithm 
Ai  and  docnment  image  Ij  in  test  dataset  S.  In  onr  experiment,  the  nnmber  of  algorithms 
k  is  b  and  the  total  nnmber  of  images  n  in  test  dataset  S  is  878.  We  assnme  that  the 
observations  from  different  images  are  statistically  independent,  i.e.,  that  Xy  and  X^/j/ 
are  independent  when  j  ^  j' .  We  also  assnme  for  a  hxed  algorithm  that  observations 
Xy,J  =  1,...  ,  n,  are  iid  random  variables  with  hnite  mean  fii  and  hnite  variance  af. 
However  the  observations  Xy  and  Xj/j  corresponding  to  two  different  algorithms,  i.e., 
i  ^  i' ^  on  the  same  image  Ij  are  statistically  dependent  since  the  two  algorithms  nse  the 
same  image  as  inpnt.  We  assnme  that  the  correlation  coefficient  pui  of  observations  of 
algorithm  Ai  and  Aii  on  the  same  page  is  constant.  This  pui  is  positive  since  a  docnment 
image  that  canses  an  algorithm  to  generate  a  bad  performance  metric  generally  will  also 
canse  other  algorithms  to  generate  bad  performance  metrics.  Let  con(Xy ,  X^/j)  =  puiaiGii 
where  i  ^  i'  is  the  covariance  of  observation  Xij,Xiij. 

Now  constrnct  a  new  random  variable  Wa/j  =  Xy  —  X^/j,  i  ^  i\  where  Wu/j  and  Wuiji 
are  independent.  Based  on  onr  assnmptions,  it  is  easily  seen  that  WaijS  are  iid  random 
variables  for  hxed  i  and  i' .  Let  Wat  and  be  the  sample  mean  and  sample  variance  of 
Wii/j  and  let  Xai  be  trne  mean  difference  snch  that  Xai  =  pi  —  pii.  An  nnbiased  estimator 
of  Ajj/  is  Aiii  =  Wii/  =  Xj  —  Xi/  since 

E[k,,]  =  E[Wu']  =  E[X,  -  X,]  =  p,-  p,  =  Au^.  (3) 

The  variance  of  the  estimator  An/  is 

Var[Au>]  =  Var[Wu>]  =  Var[X,  -  X,,]  =  ^  ~  .  (4) 

n 


6.2  Confidence  Intervals  and  Hypothesis  Testing 

In  this  section  we  address  two  important  issnes:  i)  How  does  one  characterize  nncertainty 
of  the  performance  metric  estimates?  and  ii)  If  the  average  textline  accnracy  of  one  algo¬ 
rithm  is  better  than  another,  how  do  we  verify  that  the  resnlt  is  statistically  signihcant 
and  not  jnst  dne  to  chance?  Let  ns  hrst  address  the  issne  of  nncertainty  in  performance 
estimates.  Since  Wu/j  are  iid  random  variables,  for  hxed  i  and  i'  where  i  y?  i\  by  the 
Central  Limit  Theorem  we  have 


lim 

n—¥00 


ceii'/\/n 


lim 

n— ^00 


Wii'  —  {pi  —  Pi') 
ceii'/\/n 


~  X(0, 1), 


(5) 


where  aui  is  the  trne  standard  deviation  of  Aui.  For  large  samples,  it  is  snfficient  for  ns 
to  assnme  that 

Ajj/  —  Ajj/  Wiii  —  (pi  —  Pi') 

,  '  =  — — ,  V  ^  ^  ~  f 

j  V  ^  /  V  ^ 


ff 


When  0^1  is  not  available  as  it  is  in  onr  case,  the  sample  standard  deviation  Vat  is 
typically  nsed  in  place  of  0^1  on  the  left  side  of  Eqnation  (6).  The  new  formnla  has  an 
approximate  t  distribntion  with  n  —  f  degrees  of  freedom,  i.e.. 


—  ^n' 

Vn'  l\/n 


Vu 


f  n  — 1 


Thns,  for  a  given  signihcance  level  a,  we  can  compnte  a  conhdence  interval  as 


Aii/  G  A,v/  ± 


^a/2,n—l^n‘ 


(7) 

(8) 


The  second  problem  we  want  to  address  is  whether  or  not  one  algorithm  is  performing 
signihcantly  better  than  another.  That  is,  we  want  to  test  the  hypothesis  that  the  trne 
means  of  the  observations  from  two  different  algorithms  are  signihcantly  different.  Let 
/(f)  be  the  probability  density  fnnction  (pdf)  of  the  t  distribntion  with  n  —  f  degrees  of 
freedom.  Let  T^Xn, .  .  .  ,  A^/i, .  .  . ,  A^/^)  be  the  test  statistic,  which  is  a  fnnction  of 

the  observations.  For  a  given  signihcance  level  a,  the  corresponding  hypothesis  test  can 
be  formnlated  as  follows: 


•  Nnll  hypothesis: 


•  Alternative  hypothesis: 


Hq  .  Ajj/  —  jjj^  —  0, 

Ha  •  AjjV  —  fjit  0, 


•  Test  statistic: 


T  —  T{Xii, . . . ,  Xim  A^/i, . . . ,  Aj/„)  —  [AiiJ  —  Q)/{Viii /\/n) 


(9) 


•  Approximate  distribntion  of  the  test  statistic  T  nnder  the  nnll  hypothesis  Hq: 
t  distribntion  with  n  —  1  degrees  of  freedom 


•  Rejection  region  signihcance  level  a  test: 

/— T  ^00 

f{i)di  +  /  j{i)di  (10) 

■00  JT 

Reject  the  nnll  hypothesis  Hq  if  Pyai  <  «• 


6.3  Advantages  of  the  Paired  Model  Analysis 

This  paired  test  is  valid  even  if  erf  erf  where  i  y^  i' .  We  do  not  need  to  assnme  a 
distribntion  for  observation  Xij.  Since  the  correlation  of  observations  on  the  same  image 
is  considered,  a  variance  of  (erf  +  erf  —  ‘IpaiaiOii) {n  is  obtained  for  the  estimator  Ajj/  of 
Aj^/.  This  variance  is  smaller  than  that  in  the  case  where  this  correlation  is  ignored,  i.e. 
the  two  samples  Aji, .  .  .  ,  X^n  and  Aj/i, .  .  . ,  are  assnmed  to  be  independent.  In  the 
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later  case,  since  two  samples  are  independent,  the  variance  of  the  estimator  A^i  is  given 

by  ^  ^ 

Var[k.,]  =  AlAl,  (11) 

n 

Since  piii  >  0,  it  can  be  easily  seen  that 

n  n 

In  other  words,  a  more  precise  estimate  of  A^i  is  obtained  if  we  nse  the  paired  model. 

7  Page  Segmentation  Algorithms 

Page  segmentation  algorithms  can  be  categorized  into  three  classes:  top-down  approaches, 
bottom- np  approaches  and  hybrid  approaches.  Top-down  algorithms  start  from  the  whole 
docnment  image  and  iteratively  split  it  into  smaller  ranges.  The  splitting  procednre  stops 
when  some  criterion  is  met  and  the  obtained  ranges  constitnte  the  hnal  segmentation  re- 
snlts.  Bottom-np  algorithms  start  from  docnment  image  pixels,  and  clnster  the  pixels 
into  connected  components  which  are  then  clnstered  into  words,  lines  or  hnal  zone  seg¬ 
mentations.  If  there  are  word  clnstering  or  line  clnstering  procednres,  the  hnal  zone 
segmentations  are  obtained  by  clnstering  the  words  or  lines.  The  commercial  prodncts 
nsnally  are  “black-box”  algorithms  from  which  no  algorithm  strnctnre  information  can 
be  inferred. 

7.1  The  X-Y  Cut  Page  Segmentation  Algorithm 

The  X-Y  cnt  segmentation  algorithm  [25,  26]  is  a  tree-based,  top-down  algorithm.  The 
root  node  of  the  tree  represents  the  entire  docnment  page  image  /,  an  interior  node 
represents  a  rectangle  on  the  page,  and  all  the  leaf  nodes  together  represent  the  hnal 
segmentation.  While  this  algorithm  is  easy  to  implement,  it  can  only  work  on  docnment 
pages  with  Manhattan  layont  and  rectangnlar  zones.  The  algorithm  works  as  follows: 

1.  Create  the  horizontal  and  vertical  prehx  snm  tables  Hx  and  Hy  as  follows: 
Hx[i][j]  =  #{p  G  D{I)\X{p)  =j^Y{p)  <  i,I{p)  =  !}, 

HY[i\\j]  =  4{P  G  D{I)\X{p)  <  j,Y{p)  =  i,I{p)  =  1}, 

where  D{I)  C  is  the  domain  of  the  image  I  and  I{p)  is  the  binary  valne  of  the 
image  at  pixel  p,  and  X (p)  and  Y (p)  are  the  X  and  Y  coordinates  of  the  pixel  p 
respectively. 

2.  Initialize  a  tree  with  the  entire  docnment  image  as  the  root  node.  For  each  node 
do  the  following: 

(a)  Compnte  X  and  Y  black  pixel  projection  prohle  histograms  of  the  cnrrent  node 
as  follows: 

HISx[i]  ^  Hx[Y2{Z)][i]  -  Hx[Y,{Z)][i], 

HISy[j]  ^  Hy[j][X,{Z)]  -  Hy[j][X,{Z)l 

where  Z  is  the  zone  corresponding  to  the  cnrrent  node,  and  (Xi(Z),  Yi(Z)) 
and  (X2(Z),  Y2(Z))  are  npper-left  and  lower-right  points  of  the  zone. 
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(b)  Shrink  each  current  zone  bounding  box  until  it  “tightly”  encloses  the  the 
zone  body.  Noise  removal  thresholds  TJ  and  Ty  are  then  used  to  classify 
and  remove  background  noise  pixels.  Since  noise  pixels  in  the  background  are 
assumed  to  be  distributed  uniformly,  the  noise  removal  thresholds  and  Ty 
for  a  particular  node  are  scaled  linearly  based  on  the  current  zone’s  width  and 
height. 

(c)  Repeat  step  2a. 

(d)  Obtain  the  widest  zero  valleys  Vx  and  Vy  in  the  X  and  Y  projection  prohle 
histograms  HISx  and  HlSy- 

(e)  If  Vx  >  Tx  or  Vy  >  Ty,  where  Tx  and  Ty  are  two  width  thresholds,  split 
at  the  mid-point  of  the  wider  of  Vx  and  Vy  and  generate  two  child  nodes 
Otherwise,  make  the  current  node  a  leaf  node. 

7.2  The  Docstrum  Page  Segmentation  Algorithm 

Docstrum  [28]  is  a  bottom-up  page  segmentation  algorithm  that  can  work  on  document 
page  images  with  non-Manhattan  layout  and  arbitrary  skew  angles.  However,  this  algo¬ 
rithm  only  applies  to  the  segmentation  of  text  regions.  Moreover,  it  does  not  perform  well 
when  the  document  page  image  contains  too  many  joined  characters  and  the  estimates  of 
inter-character  spacing,  inter-line  spacing  and  orientation  angle  become  inaccurate  when 
document  images  contain  sparse  characters. 

The  basic  steps  of  the  Docstrum  segmentation  algorithm  are  as  follows: 

1.  Obtain  connected  components  (Cis)  using  a  space-efficient  two-pass  algorithm  [If]. 

2.  Remove  small  and  large  noise  or  non-text  connected  components  using  low  and 
high  thresholds  /  and  h. 

3.  Separate  the  OjS  into  two  groups,  one  with  dominant  characters  and  the  other  with 
characters  in  titles  and  section  headings.  A  parameter  controls  the  clustering. 

4.  Find  the  K  nearest  neighbors,  NN7^'((!),  of  each  Oy 

5.  Compute  the  distance  and  angle  of  each  Ci  and  its  K  nearest  neighbors:  (/oj, 
such  that  j  G  NN7^'((!). 

6.  Compute  a  within-line  nearest-neighbor  distance  histogram  from  the  following  set 
bkp  :  Wp  =  {/Ojli  G  NN7v(i!),  and  —  0^  <  6’j  <  6^},  where  9h  is  the  horizontal 
angle  tolerance  threshold.  Estimate  the  within-line  inter-character  spacing  cs  as 
the  location  of  the  peak  in  the  histogram. 

7.  Compute  a  between-line  nearest-neighbor  distance  histogram  from  the  set  Bp  : 

Bp  =  C  NN7^'(i!),  and  90°  —  Oy  <  <  90°  +  where  6y  is  the  vertical  angle 

tolerance  threshold.  Estimate  the  inter-line  spacing  Is  as  the  location  of  the  peak 
in  the  histogram. 
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8.  Perform  transitive  closure  on  within-line  nearest  neighbor  pairings  to  obtain  textlines 
LiS  using  within-line  nearest  neighbor  distance  threshold  Tcs  =  ft  •  cs. 

9.  Perform  transitive  closure  on  the  ZjS  to  obtain  structural  blocks  or  zones  ZjS  using 
parallel  distance  threshold  =  fpa  '  cs  and  perpendicular  distance  threshold 
Tpe  =  fpe  •  The  parallel  and  perpendicular  distances  are  computed  as  “end-end” 
distance,  not  “centroid-centroid”  distance. 

In  our  implementation,  we  did  not  estimate  orientation  since  all  pages  in  the  dataset 
were  deskewed.  Furthermore,  we  used  a  resolution  of  t  pixel/bin  for  constructing  the 
within-line  and  between-line  histograms,  and  did  not  perform  any  smoothing  of  these 
histograms. 

7.3  The  Voronoi-Diagram-Based  Page  Segmentation  Algorithm 

Rise’s  segmentation  algorithm  [t9]  is  also  a  bottom-up  algorithm  based  on  the  Voronoi 
diagram.  This  method  can  work  on  document  page  images  that  have  non-Manhattan 
layout,  arbitrary  skew  angles,  or  non-linear  textlines.  A  set  of  connected  line  segments 
are  used  to  bound  text  zones.  Since  we  evaluate  all  algorithms  on  document  page  images 
with  Manhattan  layouts,  this  algorithm  has  been  modihed  to  generate  rectangular  zones. 

This  algorithm  tends  to  fragment  non-text  regions  (hgures,  tables  and  halftone  images) 
and  text  zones  with  irregular  font  sizes  and  spacings.  It  assumes  that  text  regions  are 
dominant  on  a  page  and  that  the  inter-character  and  inter-line  spacing  within  a  text 
region  are  uniform.  The  algorithm  steps  are  as  follows: 

1.  Label  connected  components.  A  fast  labeling  procedure  based  on  border  following 
is  used.  The  8-connected  components  and  sample  points  on  their  borders  are  si¬ 
multaneously  obtained  from  the  input  image.  The  algorithm  parameter  sr  controls 
the  number  of  sample  points  used. 

2.  Remove  noise  connected  components  using  maximum  noise  zone  size  threshold 
nm,  maximum  width  threshold  C^,  maximum  height  threshold  and  maximum 
aspect  ratio  threshold  for  all  connected  components. 

3.  Generate  a  Voronoi  diagram.  The  Voronoi  diagram  for  each  connected  component 
is  generated  using  the  sample  points  on  its  border. 

4.  Delete  superfluous  Voronoi  edges.  These  edges  are  deleted  to  obtain  text  zone 
boundaries  according  to  the  following  criteria.  Let  E  be  the  Voronoi  edge  be¬ 
tween  two  connected  components  Ci  and  Cj  and  let  d[E)  be  the  minimum  distance 
between  any  two  sample  points  from  Ci  and  Cj.  Let  T^i  be  the  estimate  of  the  inter¬ 
character  spacing,  and  let  Tct2  be  the  estimate  of  the  inter-line  spacing  plus  a  margin 

controlled  by  a  factor  fr.  Define  a,.[E)  as  the  area  ratio  max{Area(C'i),  Area((7j)}/min{Area(C'i),  Ar 
and  the  threshold  as  the  largest  area  ratio  between  the  characters  of  the  same 
font  and  size.  If  a  Voronoi  edge  satisfies  d{E)/Tdi  <  t  or  d{E)/Td2  +  a^{E) /Ta  <  t, 
it  is  deleted.  In  this  criterion,  the  first  inequality  indicates  that  the  Voronoi  edges 
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between  Ci  and  Cj  with  a  spacing  smaller  than  the  estimated  inter-character  spac¬ 
ing  are  deleted,  regardless  of  their  area  ratio.  The  second  ineqnality  implies  that  we 
do  not  delete  the  Voronoi  edge  if  1)  Ci  and  Cj  come  from  different  text  zones  that 
have  larger  spacing  than  inter-line  spacing  pins  a  margin,  or  if  2)  one  is  a  character 
and  the  other  is  a  non-text  object  that  has  very  different  area  from  the  character. 

5.  Remove  noise  zones  nsing  minimnm  area  threshold  for  all  zones,  and  nsing 
minimnm  area  threshold  zl/,  and  maximnm  aspect  ratio  threshold  Bj.  for  the  zones 
that  are  vertical  and  elongated. 

A  C  implementation  of  this  algorithm  was  provided  to  ns  by  Professor  Koichi  Kise. 

7.4  Commercial  Segmentation  Algorithms 

Two  commercial  prodncts,  Caere’s  segmentation  algorithm  [6]  and  ScanSoft’s  segmenta¬ 
tion  algorithm  [39,  40],  were  selected  for  evalnation.  They  are  representative  state-of-art 
commercial  prodncts.  Both  are  black-box  algorithms  with  no  free  parameters. 

8  Experimental  Protocol 

In  this  section  we  provide  the  details  of  onr  experimental  setnp  so  that  other  researchers 
can  replicate  onr  experiments.  The  experiment  we  condncted  has  a  training  phase  and 
a  testing  phase  for  the  three  research  algorithms,  and  only  a  testing  phase  for  the  two 
commercial  prodncts  since  they  do  not  have  nser-specihable  free  parameters.  We  nsed 
textline  accnracy  as  onr  performance  metric.  For  each  docnment  page,  we  obtained  a 
performance  metric  valne.  We  then  compnted  an  average  performance  metric  valne  over 
all  docnment  pages  in  the  training  dataset  T  or  test  dataset  S  and  report  it  as  the  hnal 
algorithm  performance  index. 

In  Section  8.f,  we  specify  the  dataset  set  nsed  in  onr  experiments.  In  Section  8.2,  we 
describe  the  training  procednre  and  specify  the  parameters  for  each  research  algorithm. 
In  Section  8.3,  we  briefly  describe  the  testing  procednre.  In  Section  8.4,  we  give  the 
details  of  the  hardware  and  software  environments. 

8.1  Dataset  Specification 

We  selected  the  University  of  Washington  Dataset  [3f]  for  the  performance  evalnation 
task  since  it  is  the  only  dataset  that  has  textline-level  gronndtrnth  for  each  docnment 
page.  All  pages  in  the  dataset  are  jonrnal  pages  from  a  large  variety  of  jonrnals  in 
diverse  snbject  areas  and  from  different  pnblishers.  The  dataset  also  has  geometric 
textline  and  zone  gronndtrnth  for  each  page.  The  textline  and  zone  gronndtrnth  are 
represented  by  non-overlapping  rectangles.  The  University  of  Washington  III  dataset 
has  1601  deskewed  binary  docnment  images  at  300  dpi  resolntion.  We  chose  a  snbset 
of  978  pages  that  correspond  to  the  University  of  Washington  I  dataset  pages  as  onr 
experimental  dataset.  We  evalnated  the  chosen  algorithms  only  on  text  regions  since  they 
carry  the  most  information  abont  a  docnment  image.  The  non-text  regions  were  ignored 
in  the  evalnation  process.  We  plan  to  extend  onr  work  to  the  evalnation  of  non-text 
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regions  in  the  fntnre.  A  training  dataset  T  of  100  docnment  pages  was  randomly  sampled 
from  the  selected  978  docnments;  the  remaining  878  docnment  pages  are  considered  as 
the  test  dataset  S. 

8.2  Algorithm  Training 

The  parameters  that  a  segmentation  algorithm  is  sensitive  to  are  antomatically  selected 
by  training  the  algorithm  on  the  100-page  training  dataset  T.  A  direct  search  optimiza¬ 
tion  procednre  [27]  is  nsed  to  search  for  the  optimal  parameter  valne  for  each  algorithm. 
A  starting  point  is  necessary  for  the  optimization  procednre.  Based  on  information  abont 
the  docnment  page  style,  a  reasonable  working  range  can  be  selected  for  each  parame¬ 
ter  of  each  algorithm.  We  chose  a  relatively  conservative  range  to  make  snre  that  the 
trne  optimnm  parameter  valnes  fell  within  the  range.  Six  different  starting  points  within 
the  reasonable  working  parameter  snbspace  for  each  research  algorithm  were  randomly 
selected  and  the  corresponding  six  convergence  points  were  obtained.  Then  we  selected 
the  parameter  valnes  corresponding  to  the  maximnm  of  the  six  optimal  parameter  valnes 
attained  in  the  six  searches.  In  the  following  sections,  we  specify  the  parameters  that 
we  optimized  for  each  algorithm  and  the  corresponding  reasonable  working  ranges.  We 
hx  the  parameters  that  the  algorithm  is  insensitive  to  and  only  train  the  ones  that  the 
algorithm  is  sensitive  to.  The  training  procednre  is  condncted  on  a  randomly  selected 
100-page  training  dataset  T. 

8.2.1  X-Y  Cut  Algorithm  Parameters 

The  X-Y  cnt  algorithm  [25,  26]  has  fonr  free  parameters.  Since  the  algorithm  is  very 
sensitive  to  all  fonr  parameters,  we  searched  for  the  optimal  valne  for  each  of  the  fonr 
parameters  over  the  reasonable  working  ranges  given  below: 

1.  X  widest  zero  valley  width  threshold  T^-  {20-250  pixels}; 

2.  Y  widest  zero  valley  width  threshold  Ty-  {20-200  pixels}; 

3.  Vertical  noise  removal  threshold  TJ:  {20-100  pixels}; 

4.  Horizontal  noise  removal  threshold  Ty'.  {20-100  pixels}. 

Since  in  most  cases  the  vertical  cnt  is  longer  than  horizontal  cnt,  we  set  the  maximnm 
of  be  larger  than  that  of  Ty  .  Fnrthermore,  since  most  inter-line  gaps  are  less  than 
100  pixels,  we  set  the  maximnm  of  TJ  and  Ty  to  100  pixels. 

8.2.2  Docstrum  Algorithm  Parameters 

O’Gorman  in  his  paper  specihes  eight  parameters  for  the  Docstrnm  algorithm  [28].  We 
introdnced  two  additional  parameters  for  textline  segmentation  control  and  character 
gronping:  1)  a  snperscript-snbscript  character  distance  threshold  factor  for  correctly 
handing  textline  segmentation,  and  2)  a  character  size  ratio  threshold  to  separate  larger 
characters  from  dominant  characters.  The  algorithm  is  insensitive  to  six  of  the  ten  param¬ 
eters.  We  hxed  these  six  parameters  as  follows:  nnmber  of  nearest  connected  components 
for  clnstering,  K  =  9]  low  connected  component  size-threshold,  1  =  2  pixels;  high  con¬ 
nected  component  size-threshold,  h  =  200  pixels;  horizontal  angle  tolerance  threshold. 
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6h  =  30°;  vertical  angle  tolerance  threshold,  6^  =  30°;  snperscript  and  snbscript  char¬ 
acter  distance  threshold  factor,  fg  =  0.4.  The  valnes  for  the  fonr  parameters  that  the 
algorithm  is  sensitive  to  were  searched  for  in  the  reasonable  working  ranges  given  below: 

1.  Nearest  neighbor  threshold  factor  ft'.  {1-5}; 

2.  Parallel  distance  threshold  factor  fpa'.  {2-10}; 

3.  Perpendicnlar  distance  threshold  factor  fpe'.  {0.5-5}; 

4.  Character  size  ratio  factor  fd'.  {2-10}. 

8.2.3  Voronoi-Diagram-Based  Algorithm  Parameters 

Rise’s  algorithm  has  eleven  free  parameters  and  is  insensitive  to  seven  of  them.  Six 
of  these  eleven  parameters  are  related  to  removing  noise  connected  components  and 
blocks.  The  algorithm  is  insensitive  to  another  of  these  eleven  parameters,  sw.  We  hxed 
the  seven  parameters  as  follows:  maximum  height  and  width  thresholds  of  a  connected 
component,  Ch  =  500  pixels  and  =  500  pixels;  maximum  connected  component  aspect 
ratio  threshold,  =  5;  minimum  area  threshold  of  a  zone,  =  50  pixels^  for  all  zones; 
and  minimum  area  threshold,  Ai  =  40000  pixels,  and  maximnm  aspect  ratio  threshold, 
Bj.  =  4,  for  the  zones  that  are  vertical  and  elongated.  The  last  parameter  is  the  size  of 
the  smoothing  window,  which  is  hxed  at  sw  =  2.  The  optimal  valnes  for  the  other  fonr 
parameters  are  searched  for  in  the  following  ranges  recommended  by  Rise: 

1.  sampling  rate  sr:  {4-7}; 

2.  Max  size  threshold  of  noise  connected  component  nm\  {fO-40}; 

3.  Margin  control  factor  for  Td2  fr:  {0. Of -0.5}; 

4.  Area  ratio  threshold  ta\  {40-200}. 

8.3  Algorithm  Testing 

All  hve  algorithms  were  tested  on  the  878-page  test  dataset  S.  In  order  to  be  able  to 
compare  the  timing  information  for  each  algorithm,  we  tested  all  algorithms  on  the  same 
machine. 

8.4  Hardware  and  Software  Environments 

In  this  section,  we  provide  the  details  of  the  hardware  and  software  environments  nsed 
for  the  implementation  and  training  of  the  research  algorithms  and  the  testing  of  both 
the  research  algorithms  and  the  commercial  prodncts. 

8.4.1  Implementation 

We  implemented  the  X-Y  cnt  and  Docstrnm  algorithms  based  on  [25,  26,  28].  The 
platform  nsed  for  the  implementation  was  an  Ultra  1  Snn  workstation  rnnning  the  Solaris 
2.6  operating  system.  The  compiler  nsed  was  GNU  gcc  2.7.2.  In  onr  implementation 
of  the  two  research  algorithms  and  the  benchmarking  algorithm,  a  DATS  library  of 
Release  1.0  developed  by  RAF  Technology  Inc.  [35]  was  nsed.  Rise  [19]  provided  ns  with 
an  implementation  of  his  Voronoi-based  segmentation  algorithm.  We  wrote  programs  in 
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the  Visual  C++  5.0  environment  to  extract  zone  coordinates  from  the  OCR  output  of 
the  two  commercial  products. 

8.4.2  Training  Phase 

The  machines  we  used  for  training  were  Ultra  1,2  and  5  Sun  workstations  running  the 
Solaris  2.6  operating  system.  We  used  a  direct  search  simplex  algorithm  for  searching  for 
a  set  of  optimal  parameter  values  for  each  research  algorithm. 

8.4.3  Testing  Phase 

In  order  to  be  able  to  compare  the  timing  information  for  the  algorithms,  we  tested 
them  on  a  single  machine,  an  Ultra  1  Sun  workstation  running  the  Solaris  2.6  operating 
system.  The  CPU  speed  reported  by  the  fpversion  UNIX  command  was  167  MHz. 
The  two  commercial  products  were  tested  on  a  Gateway  PC  with  a  400  MHz  Pentium  If 
CPU  running  the  Windows  95  operating  system.  We  normalized  the  PC  timing  to  UNIX 
timing  using  the  relation  tuNix  =  400  •  fpc/lh?  for  comparison  with  the  timing  of  the 
research  algorithms. 

9  Experimental  Results  and  Discussion 

In  this  section,  four  aspects  of  the  experimental  results  are  reported:  training,  test,  sta¬ 
tistical  analysis  and  error  analysis.  In  the  training  section,  we  report  optimal  parameter 
values  and  training  times,  show  the  convergence  curves,  and  discuss  the  convergence  rate 
for  each  research  algorithm  on  the  training  dataset  T .  In  the  testing  section,  we  report 
the  performance  metric  and  timing  results  for  all  hve  algorithms  on  test  dataset  S.  Fur¬ 
thermore,  a  conhdence  interval  is  calculated  for  the  results.  In  the  statistical  analysis 
section,  we  compare  the  performance  metric  and  timing  of  each  possible  algorithm  pair 
using  the  paired  model.  In  the  error  analysis  section,  we  report  error  analysis  results 
on  three  different  error  categories  for  each  algorithm  and  identify  the  possible  sources 
of  these  errors  for  each  research  algorithm.  Finally,  based  on  the  discussion  in  the  error 
analysis  section,  we  provide  some  recommendations  for  users  to  choose  appropriate  page 
segmentation  algorithms  for  their  purpose. 

9.1  Training  Results 

Three  research  algorithms  were  trained  on  a  fOO-page  training  dataset  T .  Table  2,  Table  3 
and  Table  4  report  the  optimum  parameters,  optimum  performance  index  (textline  accu¬ 
racy)  value,  and  training  time  corresponding  to  each  randomly  selected  starting  point  for 
the  X-Y  cut,  Docstrum  and  Voronoi  algorithms  respectively.  We  consider  the  parameter 
values  that  give  the  lowest  error  rate  as  a  set  of  optimal  parameter  values  for  each  research 
algorithm  as  shown  in  Table  1.  Figure  4,  Figure  5  and  Figure  6  show  the  convergence 
characteristics  for  the  X-Y  cut,  Docstrum  and  Voronoi  algorithms  respectively. 

The  hndings  from  the  training  results  for  each  research  algorithm  are  summarized  as 
the  follows: 
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Table  1:  Optimal  parameter  values  for  each  research  algorithm. 


algorithm 

values 

optimal  parameter 
value 

error  rate 
(percent) 

function 

evaluations 

timing 

(hours) 

X-Y  cut 

(78,32,35,54) 

14.71 

86 

12.70 

Docstrum 

(2.578,2.345,0.600,  9.930) 

5.00 

108 

7.00 

Voronoi 

(6,11,0.083,200) 

4.73 

52 

6.99 

1)  The  X-Y  cut  algorithm.  From  Table  2  and  Figure  4,  we  can  make  the  following 
observations: 


Figure  4:  Convergence  curves  corresponding  to  six  randomly  selected  starting  points  in 
the  training  of  the  X-Y  cut  algorithm. 


Table  2:  Optimization  results  of  the  X-Y  cut  algorithm  for  six  randomly  selected  starting 
points  within  a  reasonable  working  parameter  subspace. 


starting  parameter 
values 

optimal  parameter 
value 

error  rate 
(percent) 

number  of  function 
evaluations 

timing 

(hours) 

(140,80,50,70) 

(128,62,23,91) 

18.96 

98 

25.88 

(120,120,10,80) 

(97, 107,22,97) 

18.57 

70 

10.45 

(80,40,70,50) 

(82,  64,21,89) 

15.52 

58 

11.32 

(60,120,10,20) 

(67,  55,21,70) 

14.96 

92 

17.64 

(100,80,100,50) 

(100,79,100,49) 

44.38 

25 

4.33 

(80,20,70,50) 

(78, 32,35,54) 

14.71 

86 

12.70 

•  The  error  rates  for  all  starting  points  (except  one)  converge  in  the  range  of  14.71% 
to  18.96%, 

•  The  convergence  rate  before  the  hrst  20  function  evaluations  is  much  faster  than 
that  beyond  20  function  evaluations, 

•  Most  values  of  parameter  TJ  are  larger  than  those  of  parameter  Ty, 
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•  All  values  (except  those  of  the  outlier  point)  of  parameter  are  smaller  than 
those  of  parameter  Ty  , 

•  Except  for  the  unusual  point,  the  variance  of  the  number  of  function  evaluations  is 
small, 

•  There  is  a  fair  amount  of  variation  in  the  optimal  parameter  values. 

From  the  above  observations,  we  can  see  that  the  X-T  cut  algorithm  objective  function 
has  multiple  local  minima,  and  the  performance  at  these  local  minima  is  not  very  stable. 
The  algorithm  only  needs  about  20  function  evaluations  to  reach  stable  performance. 
The  vertical  cuts  are  generally  longer  than  horizontal  cuts.  The  vertical  inter-zone  gaps 
are  generally  wider  than  horizontal  inter-zone  gaps. 

2)  Docstrum  algorithm.  From  Table  3  and  Figure  5,  we  can  make  the  following  ob¬ 
servations: 


Figure  5:  Convergence  curves  corresponding  to  six  randomly  selected  starting  points  in 
the  training  of  the  Docstrum  algorithm. 


Table  3:  Optimization  results  of  the  Docstrum  algorithm  for  six  randomly  selected  start¬ 
ing  points  within  a  reasonable  working  parameter  subspace. 


starting  parameter 
values 

optimal  parameter 
value 

error  rate 
(percent) 

number  of  function 
evaluations 

timing 

(hours) 

(1.0, 4.0, 1.5, 4.0) 

(2.327,2.344,0.597,5.223) 

5.01 

109 

15.08 

(2.0,3.0,0.3,4.0) 

(2.362,2.129,0.597,4.383) 

5.44 

75 

10.31 

(5.0,4.0,0.3,3.0) 

(3.072,2.138,0.302,3.602) 

6.30 

115 

12.74 

(1.0,4.0,2.1,6.0) 

(2.537,1.973,0.645,7.551) 

5.34 

102 

13.66 

(3.0,4.0,3.0,7.0) 

(2.578,2.345,0.600,9.930) 

5.00 

108 

7.00 

(3.0,3.0,2.1,9.0) 

(2.521,2.336,0.595,10.375) 

5.00 

139 

12.77 

•  The  error  rates  for  all  starting  points  except  an  outlier  converge  in  the  range  of 
5.00%  to  6.30%, 


•  The  convergence  rate  before  the  hrst  50  function  evaluations  is  much  faster  than 
that  beyond  50  function  evaluations. 
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•  The  parameters  ft,  fpa  and  fp^  converge  to  very  similar  values  from  all  starting 
points, 

•  There  is  a  large  variation  in  the  optimal  values  of  parameter  fd, 

•  The  total  number  of  function  evaluations  is  larger  than  those  for  the  X-Y  cut  and 
Voronoi  algorithms. 

From  the  above  observations,  we  can  see  that  the  performance  of  the  algorithm  stabi¬ 
lizes  after  about  50  function  evaluations,  which  is  much  larger  than  for  the  X-Y  cut 
and  Voronoi  algorithms.  The  performance  of  the  Docstrum  algorithm  is  insensitive  to 
large  (  >  5)  values  of  parameter  fd,  since  for  small  fd,  more  connected  components  are 
grouped  into  the  sparse  connected  component  group  where  the  inter-character  and  inter¬ 
line  gap  estimation  is  not  accurate,  and  hence  more  errors  will  occur.  However,  for  the 
other  three  parameters,  the  fact  that  the  optimal  values  are  very  close  implies  the  ob¬ 
jective  function  may  have  a  single  “valley”  in  the  neighborhood  of  these  parameter  values. 

3)  Rise’s  Area-Voronoi-Diagram-Based  algorithm.  From  Table  4  and  Figure  6,  we  can 
make  the  following  observations: 


Figure  6:  Convergence  curves  corresponding  to  six  randomly  selected  starting  points  in 
the  training  of  the  Voronoi  algorithm. 


Table  4:  Optimization  results  of  the  Voronoi  algorithm  for  six  randomly  selected  starting 
points  within  a  reasonable  working  parameter  subspace. 


starting  parameter 
values 

optimal  parameter 
value 

error  rate 
(percent) 

number  of  function 
evaluations 

timing 

(hours) 

(6,25,0.1,80) 

(6,15,0.084,108) 

4.80 

44 

8.77 

(7, 10,0.1,180) 

(6,11,0.083,200) 

4.73 

52 

6.99 

(6,30,0.3,60) 

(6,11,0.146, 149) 

5.31 

94 

24.11 

(7, 15,0.4, 120) 

(8,11,0.0977, 191) 

5.17 

74 

19.17 

(6,35,0.25, 120) 

(6,11,0.246,193) 

5.52 

66 

8.69 

(4,25,0.05, 140) 

(4,11,0.134,161) 

5.49 

43 

7.17 

•  The  error  rates  for  all  starting  points  converge  in  the  range  of  4.73%  to  5.52%, 
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•  The  convergence  rate  before  the  hrst  20  fnnction  evalnations  is  mnch  faster  than 
that  beyond  20  fnnction  evalnations, 

•  The  valne  parameter  nm  for  most  (hve)  starting  points  converges  to  ff  pixels, 

•  There  is  a  relatively  small  variance  in  the  convergence  valnes  of  parameters  sr  and 
ia^ 

•  There  is  a  relatively  large  variance  in  the  convergence  valnes  of  parameter  /r, 

•  There  is  a  relatively  large  variance  in  the  number  of  fnnction  evalnations  corre¬ 
sponding  to  the  six  starting  points. 

From  the  above  observations,  we  can  see  that  the  Voronoi  algorithm  objective  fnnction 
has  mnltiple  local  minima,  bnt  the  performance  at  these  local  minima  is  stable.  The 
algorithm  needs  only  abont  20  fnnction  evalnations  to  reach  a  stable  performance.  The 
optimal  algorithm  performance  is  insensitive  to  the  valne  of  parameter  fr.  The  fact  that 
the  optimal  valne  of  parameter  ta  is  large  implies  that  the  text  and  non-text  connected 
components  are  well  separated.  The  fact  that  the  valnes  of  parameter  fr  are  generally 
small  indicates  that  we  shonld  choose  a  conservative  (large)  interline  spacing  threshold. 

9.2  Testing  Results 

All  hve  algorithms  were  tested  on  a  878-page  test  dataset  S  with  their  respective  optimnm 
parameters.  Table  5  reports  the  performance  index  (textline  accnracy)  and  average 
algorithm  timing  on  the  test  dataset  S.  Fignre  7  gives  a  bar-chart  representation  of  the 
testing  resnlts  for  each  evalnated  algorithm. 

Table  5:  Algorithm  testing  resnlts  and  the  corresponding  95%  conhdence  intervals.  The 
average  time  per  page  is  also  reported.  The  times  taken  by  the  two  commercial  prodncts 
were  normalized  for  the  processor  speed  differences  between  the  PC  and  the  SUN. 


Performance  Index 
(percent) 

Average  Processing  Time 
(seconds) 

Voronoi 

94.64  ±  0.78 

9.09  ±  0.18 

Docstrnm 

94.11  ±  0.99 

15.43  ±  0.32 

X-Y  cut 

82.94  ±  1.61 

6.37  ±  0.07 

Caere 

93.97  ±  0.85 

2.02  ±  0.01,  (normalized)  4.84  ±  0.04 

ScanSoft 

87.29  ±  1.35 

3.13  ±  0.04,  (normalized)  7.52  ±  0.10 

From  the  testing  resnlts,  we  see  that  the  Voronoi-based,  Docstrnm,  and  Caere  algo¬ 
rithms  have  similar  performance  indices  which  are  better  than  that  of  ScanSoft’s  algo¬ 
rithm,  which  in  tnrn  is  better  than  that  of  the  X-Y  cnt  algorithm.  Caere’s  segmentation 
algorithm  has  the  least  average  processing  time,  whereas  Docstrnm  has  the  greatest  av¬ 
erage  processing  time.  The  connected  component  labeling  method  we  nsed  for  Docstrnm 
may  not  be  the  optimnm  one,  and  hence  its  timing  may  be  farther  improved. 

For  comparison  pnrposes,  an  evalnator  always  likes  to  know  if  the  performance  index 
and  processing  time  differences  between  algorithms  are  statistically  signihcant  or  not. 
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Performance  Index  (text-line  accuracy) 


Algorithm  Timing 


Segmentation  Algorithm  Segmentation  Algorithm 

(a)  (b) 

Figure  7:  The  first  three  algorithms  in  the  bar  chart  are  reseach  algorithms,  and  the  last 
two  algorithms  are  commercial  products,  (a)  shows  the  testing  results  of  the  performance 
index  (textline  accuracy)  for  each  algorithm.  A  5%  level  t-test  indicates  that  the  perfor¬ 
mances  of  Voronoi,  Docstrum  and  Caere  are  not  signihcantly  different,  but  the  three  are 
signihcantly  better  than  ScanSoft,  which  in  turn  is  signihcantly  better  than  X-Y  cut.  (b) 
shows  the  algorithm  testing  time  results  for  each  algorithm.  A  5%  level  t-test  indicates 
that  each  algorithm’s  timing  is  signihcantly  different  from  that  of  any  other  algorithm. 
From  fastest  to  slowest,  the  algorithms  are  ranked  as:  Caere,  X-Y  cut,  ScanSoft,  Voronoi 
and  Docstrum. 

especially  for  those  algorithms  with  similar  performance  index  values.  This  is  addressed 
in  the  following  section. 

9.3  Statistical  Analysis  of  Results 

We  employed  a  paired  model  [18]  to  compare  the  performance  index  and  testing  time 
differences  between  each  possible  algorithm  pair,  and  then  compute  their  conhdence 
intervals.  The  analysis  results  for  performance  index  and  processing  time  are  reported 
in  matrix  form  in  Table  6  and  Table  7  respectively.  If  we  denote  by  the  value  of 
the  table  cell  in  the  tth  row  and  jth  column,  =  a*  —  aj  where  Oj  is  the  performance 
index  (processing  time)  value  of  the  algorithm  in  the  tth  row,  and  Uj  is  the  performance 
index  (processing  time)  value  of  algorithm  in  the  jth  column.  Note  that  the  normalized 
processing  time  is  used  for  the  two  commercial  products. 

From  Table  6,  we  hnd  that  the  differences  between  the  performance  indices  of  Rise’s 
algorithm,  Caere’s  segmentation  algorithm,  and  Docstrum  are  not  statistically  signihcant, 
but  they  are  signihcantly  better  than  those  of  ScanSoft ’s  segmentation  algorithm  and 
the  X-Y  cut  algorithm.  Moreover,  the  performance  index  of  ScanSoft ’s  segmentation 
algorithm  is  signihcantly  better  than  that  of  the  X-Y  cut  algorithm.  From  Table  7,  we 
can  hnd  that  the  processing  times  of  all  algorithms  differ  signihcantly  from  one  another. 
From  the  fastest  processing  time  to  the  slowest  processing  time,  the  algorithms  are  ranked 
as  Caere’s  segmentation  algorithm,  X-Y  cut,  ScanSoft’s  segmentation  algorithm.  Rise, 
and  Docstrum.  For  Docstrum,  a  better  connected  component  labeling  algorithm  might 
improve  its  timing  performance. 
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Table  6:  Paired  model  statistical  analysis  results  on  the  difference  between  a  pair  of 
performance  indexes  (in  percent)  and  the  corresponding  95%  conhdence  intervals.  A  (*) 
indicates  that  the  difference  is  statistically  signihcant  at  a  =  0.05,  and  no  (*)  indicates 
that  the  difference  is  not  signihcant.  We  see  that  there  is  no  signihcant  difference  between 
the  Voronoi,  Docstrum  and  Caere  algorithms.  However,  this  group  is  signihcantly  better 
than  Scansoft,  which  is  in  turn  is  better  than  XY-cut. 


Caere 

Docstrum 

ScanSoft 

X-Y  cut 

Voronoi 

0.66  ±  1.17 
Pval  =  0.13 

0.52  ±  1.23 

=  0.20 

7.33  ±  1.55  (♦) 

=  9.02£  -  20 

11.69  ±  1.80  (♦) 

=  2.29£  -  34 

Caere 

- 

-0.13  ±  1.09 
^val  —  0-40 

6.67  ±  1.38  (♦) 

=  1.26£  -  20 

11.04  ±  1.67  (♦) 

=  1.21£-35 

Docstrum 

- 

- 

6.81  ±  1.59  (♦) 

=  SME  -  17 

11.18  ±  1.79  (♦) 

=  4.61£  -  32 

ScanSoft 

- 

- 

- 

4.36  ±  1.87  (*) 

=  2.79£  -  06 

Table  7:  Paired  model  statistical  analysis  results  on  the  difference  in  processing  times 
(seconds)  and  the  corresponding  95%  conhdence  intervals.  A  (*)  indicates  the  difference 
is  statistically  signihcant  at  a  =  0.05  and  no  (*)  implies  the  difference  is  not  signihcant. 
We  see  that  from  the  least  to  the  greatest  averge  processing  time,  the  algorithms  are 
ranked  as:  Caere,  X-Y  cut,  ScanSoft,  Voronoi  and  Docstrum. 


Caere 

Docstrum 

ScauSoft 

X-Y  cut 

Voronoi 

4.25  ±  0.16  (♦) 

Pval  =  0 

-6.35  ±  0.18  (♦) 
Pval  =  0 

1.57  ±  0.20  (♦) 

=  1.91£-48 

2.72  ±  0.13  (♦) 
Pval  =  0 

Caere 

- 

-10.59  ±  0.29  (♦) 
Pval  =  0 

-2.68  ±  0.10  (♦) 
Pval  =  0 

-1.53  ±  0.05  (♦) 
Pval  =  0 

Docstrum 

- 

- 

7.91  ±  0.32  (*) 
Pval  =  0 

9.06  ±  0.26  (♦) 
Pval  =  0 

ScauSoft 

- 

- 

- 

1.15  ±  0.12  (*) 

=  1.06£  -  66 

9.4  Error  Analysis 

Error  analysis  is  crucial  to  interpreting  the  functionalities  of  the  evaluated  algorithms. 
Each  algorithm  has  different  weakness.  Eigure  8  shows  the  error  analysis  results  of  three 
error  types  for  each  algorithm. 

We  can  see  that  among  the  research  algorithms,  X-Y  cut  has  a  much  larger  split 
textline  error  rate  than  the  Voronoi  and  Docstrum  algorithms.  This  is  mainly  due  to 
the  fact  that  the  two  zone  cut  thresholds  (or  widest  zero  valley  thresholds)  and 
Ty  and  the  two  noise  removal  thresholds  TJ  and  Ty  are  global  thresholds  that  are 
fixed  for  each  document  image,  whereas  in  the  Voronoi  and  Docstrum  algorithms,  the 
inter-character  and  inter-line  spacings  are  estimated  for  each  individual  document  image. 
Titles  with  wide  inter-character  and  inter- word  spacings,  numbered  text  lists  and  textlines 
with  irregular  character  spacings  in  some  document  images  make  the  spacing  parameter 
estimation  inaccurate  in  both  the  Voronoi-based  and  Docstrum  algorithms,  and  hence 
contribute  to  the  split  textline  error  rates  in  these  two  algorithms.  However,  these  error 
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rates  are  much  smaller  than  that  of  X-Y  cut.  We  can  see  that  among  the  research 
algorithms,  X-Y  cut  has  the  largest  horizontally  merged  textline  error  rate,  Docstrum 
has  the  second  highest  such  error  rate,  and  Voronoi  has  the  lowest.  This  occurs  primarily 
for  the  following  reasons:  1)  There  are  pages  that  have  “L” -shaped  thick,  long  noise  blocks 
at  the  edges,  which  cannot  be  cut  through  in  either  the  X  or  Y  direction  by  the  noise 
removal  thresholds  TJ  and  Ty  of  the  X-Y  cut  algorithm,  so  that  many  text  regions  under 
these  noise  blocks  are  merged  together.  2)  In  Docstrum’s  implementation,  the  huge  noise 
blocks  encountered  by  the  X-Y  cut  algorithm  are  hltered  out  in  a  preprocessing  step,  so 
that  they  do  not  affect  connected  component  and  textline  clustering  procedures.  Also, 
Docstrum  estimates  inter-character  and  inter-line  spacing  for  each  individual  document 
image.  3)  In  the  Voronoi-based  algorithm’s  implementation,  in  addition  to  what  has  been 
done  for  Docstrum,  Kise  uses  not  only  the  spacing  of  the  connected  components  but  also 
their  area  ratios  to  generate  zone  boundaries.  Hence  a  few  lines  or  noise  blocks  between 
text  regions  do  not  cause  horizontal  merges,  whereas  they  do  cause  horizontal  merges  in 
the  Docstrum  algorithm.  This  is  the  main  reason  why  Docstrum  has  more  horizontally 
merged  textlines  than  the  Voronoi-based  algorithm.  We  can  see  that  among  the  research 
algorithms,  the  X-Y  cut  has  the  highest  mis-detection  error  rate  while  Voronoi  and 
Docstrum  have  negligible  error  rates.  This  is  again  due  to  the  global  thresholds  of  the  X- 
Y  cut  algorithm  which  cause  textlines  such  as  headers,  footers,  authors  and  page  numbers 
that  are  not  aligned  with  text  blocks  to  be  considered  as  noise  regions  and  hence  not  to 
be  detected. 

9.5  Recommendations 

Based  on  the  discussion  in  the  last  section,  we  feel  that  some  recommendations  may 
be  useful  to  users  who  can  make  a  choice  among  page  segmentation  algorithms.  We 
summarize  our  recommendations  about  the  three  research  algorithms  as  follows: 

•  For  segmentation  of  document  pages  with  large  skew  angles  or  large  noise  blocks 
(especially  “L” -shaped,  “U” -shaped  or  close-shaped  thick  noise  bars),  the  X-Y  cut 
algorithm  is  a  bad  choice. 

•  For  segmentation  of  document  pages  with  lines  separating  zones,  the  Voronoi-based 
algorithm  is  a  better  choice  than  either  the  Docstrum  or  X-Y  cut  algorithm. 

•  For  segmentation  of  document  images  with  a  large  font  size  range,  irregular  inter¬ 
character  or  inter-line  spacing,  few  noise  blocks,  and  negligible  skew  angles,  X-Y 
cut  is  a  better  choice  than  either  the  Voronoi-based  or  Docstrum  algorithms. 

•  The  Voronoi-based  algorithm  is  preferred  over  the  Docstrum  algorithm  in  general. 

•  For  the  X-Y  cut  algorithm,  hrst  remove  large  noise  blocks  by  labeling  connected 
components  and  then  removing  the  larger  ones. 
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split  Text-line  Error 


Horizontally  Merged  Text-line  Error 


Miss-detected  Text-line  Error 


(a)  (b)  (c) 

Figure  8:  This  figure  three  types  of  errors,  (a)  shows  the  page  error  rate  as  the  ratio  of 
the  number  of  groundtruth  textlines  whose  bound  boxes  are  split  and  the  total  number  of 
groundtruth  textlines.  We  denote  this  error  category  as  split  error.  We  can  see  that  a  5% 
level  f-test  indicates  that  the  error  rates  of  ScanSoft  and  X-Y  Cut  are  not  significantly 
different,  but  they  are  siginificantly  higher  than  those  of  the  other  three  algorithms. 
Moreover,  the  error  rates  of  Voronoi,  Docstrum  and  Caere  are  significantly  different  from 
each  other,  (b)  shows  the  page  error  rate  as  the  ratio  of  the  number  of  groundtruth 
textlines  that  are  horizontally  merged  and  the  total  number  of  groundtruth  textlines. 
We  denote  this  error  category  as  horizontal  merge  error.  We  can  see  that  a  5%  level 
f-test  indicates  that  the  error  rate  of  all  five  algorithms  are  significantly  different  from 
each  other,  (c)  shows  the  page  error  rate  as  the  ratio  of  the  number  of  groundtruth 
textlines  that  are  missed  and  the  total  number  of  groundtruth  textlines.  We  denote  this 
error  category  as  mis-detection  error.  We  can  see  that  the  rate  of  this  error  type  is  much 
smaller  than  those  of  the  other  two  error  types.  From  the  lowest  to  the  highest  error 
rate,  the  algorithms  are  ranked  as:  Docstrum,  Voronoi,  Caere,  X-Y  cut  and  ScanSoft. 


10  Conclusions 

We  have  proposed  a  five-step  performance  evaluation  methodology  for  evaluating  page 
segmentation  algorithms:  1)  First  we  randomly  partition  the  dataset  T>  into  a  mutually 
exclusive  training  dataset  T  and  test  dataset  S  with  both  textline-level  and  zone-level 
groundtruth,  2)  we  then  define  textline  accuracy  as  the  performance  metric,  3)  the  Nelder- 
Mead  simplex  search  algorithm  is  then  used  to  search  automatically  for  the  optimal 
parameter  values  of  the  segmentation  algorithms,  4)  the  segmentation  algorithms  are 
then  evaluated  on  the  test  dataset  S  using  their  corresponding  optimal  parameter  values, 
and  finally  5)  a  paired-model  statistical  analysis  and  an  error  analysis  are  performed  to 
provide  confidence  intervals  and  to  test  hypotheses  regarding  the  performance  indices 
and  algorithm  timings.  The  errors  of  three  research  algorithms  were  analyzed  in  terms  of 
mis-detection,  split  and  horizontal  merge  error  types.  We  found  that  the  performances  of 
the  Voronoi,  Docstrum  and  Caere  segmentation  algorithms  are  not  significantly  different 
from  one  another,  but  they  are  significantly  better  than  that  of  ScanSoft’s  segmentation 
algorithm,  which  in  turn  is  significantly  better  than  that  of  X-Y  cut.  We  also  found 
that  the  timings  of  the  algorithms  are  significantly  different  from  one  another.  From  the 
fastest  to  the  slowest,  the  algorithms  are  ranked  as  Caere,  X-Y  cut,  ScanSoft,  Voronoi 
and  Docstrum.  In  the  error  analysis,  we  found  that  X-Y  cut  has  the  most  split  and 
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horizontal  merge  errors  due  to  its  global  thresholds,  Voronoi  has  the  least  horizontal 
merge  errors  due  to  its  usage  of  area  ratio  information  of  connected  components,  and 
Caere  has  the  least  split  error.  We  intend  to  extend  this  work  to  evaluation  of  tables, 
graphs  and  half-tone  images. 
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